Co-processing synchronizing techniques on heterogeneous graphics processing units

ABSTRACT

The graphics co-processing technique includes receiving display operation for execution by a graphics processing unit on an unattached adapter. The display operation is split into a copy from a frame buffer of the graphics processing unit on the unattached adapter to a buffer in system memory, a copy from the buffer in system memory to a frame buffer of graphics processing unit on a primary adapter, and a present from the frame buffer of the graphics processing unit on the primary adapter to a display. Execution of the copy from the frame buffer of the graphics processing unit on the unattached adapter to the buffer in system memory and the copy from the buffer in system memory to the frame buffer of the graphics processing unit on the primary adapter are synchronized.

CROSS-REFERENCE TO RELATED APPLICATIONS

This claims the benefit of U.S. Provisional Patent Application No.61/243,155 filed Sep. 16, 2009 and U.S. Provisional Patent ApplicationNo. 61/243,164 filed Sep. 17, 2009.

BACKGROUND OF THE INVENTION

Conventional computing systems may include a discrete graphicsprocessing unit (dGPU) or an integral graphics processing unit (iGPU).The discrete GPU and integral GPU are heterogeneous because of theirdifferent designs. The integrated GPU generally has relatively poorprocessing performance compared to the discrete GPU. However, theintegrated GPU generally consumes less power compared to the discreteGPU.

The conventional operating system does not readily support co-processingusing such heterogeneous GPUs. Referring to FIG. 1, a graphicsprocessing technique according to the conventional art is shown. When anapplication 110 starts, it calls the user mode level runtime applicationprogramming interface (e.g., DirectX API d3d9.dll) 120 to determine whatdisplay adapters are available. In response, the runtime API 120enumerates the adapters that are attached to the desktop (e.g., theprimary display 180). A display adapter 165, 175, even recognized andinitialized by the operating system, will not be enumerated in theadapter list by the runtime API 120 if it is not attached to thedesktop. The runtime API 120 loads the device driver interface (DDI)(e.g., user mode driver (umd.dll)) 130 for the GPU 170 attached to theprimary display 180. The runtime API 120 of the operating system willnot load the DDI of the discrete GPU 175 because the discrete GPU 175 isnot attached to the display adapter. The DDI 130 configures commandbuffers of the graphics processor 170 attached to the primary display180. The DDI 130 will then call back to the runtime API 120 when thecommand buffers have been configured.

Thereafter, the application 110 makes graphics request to the user modelevel runtime API (e.g., DirectX API d3d9.dll) 120 of the operatingsystem. The runtime 120 sends graphics requests to the DDI 130 whichconfigures command buffers. The DDI calls to the operating system kernelmode driver (e.g., DirectX driver dxgkrnl.sys) 150, through the runtimeAPI 120, to schedule the graphics request. The operating system kernelmode driver then calls to the device specific kernel mode driver (e.g.,kmd.sys) 150 to set the command register of the GPU 170 attached to theprimary display 180 to execute the graphics requests from the commandbuffers. The device specific kernel mode driver 160 controls the GPU 170(e.g., integral GPU) attached to the primary display 180.

Therefore, there is a need to enable co-processing on heterogeneousGPUs. For example, it may be desired to use a first GPU to performgraphics processing for a first class of applications and a second GPUfor a second class of applications depending upon processing performanceand power consumption parameters.

SUMMARY OF THE INVENTION

Embodiments of the present technology are directed toward graphicsco-processing. The present technology may best be understood byreferring to the following description and accompanying drawings thatare used to illustrate embodiment of the present technology.

In one embodiment, a graphics co-processing method includes injecting anapplication initialization routine, when an application starts, thatincludes an entry point that changes a search path for a display deviceinterface to a search path of a shim layer library, and that includes anentry point that identifies the application. As a result the shim layerlibrary is loaded. The shim layer library initializes a display deviceinterface for a first graphics processing unit on a primary adapter anda display device interface for a second graphics processing unit on anunattached adapter, wherein the display device interface on theunattached adapter is initialized without calling back to a runtimeapplication programming interface. In addition, the shim layer librarydetermines if the application has an affinity for execution of graphicscommands on the second graphics processing unit. The shim layer alsosplits a display command, if there is an affinity, into a copy from aframe buffer of the second graphics processing unit to a buffer insystem memory, a copy from the buffer in system memory to a frame bufferof the first graphics processing unit, and a present from the framebuffer of the first graphics processing unit on a display on the primaryadapter.

In another embodiment, a graphics co-processing method includes loadinga device specific kernel mode driver of a second graphics processingunit tagged as a non-graphics device. A device driver interface and adevice specific kernel mode driver for a first graphics processing uniton a primary adapter are loaded and initialized. A device driverinterface for the second graphics processing unit on a non-graphicsdevice tagged adapter is loaded and initialized without the devicedriver interface talking back to a runtime application programminginterface when a particular version of an operating system will nototherwise allow the device specific kernel mode driver for the secondgraphics processing unit to be loaded. Thereafter, display command issplit into a copy from a frame buffer of the second graphics processingunit to a buffer in system memory, a copy from the buffer in systemmemory to a frame buffer of the first graphics processing unit, and apresent from the frame buffer of the first graphics processing unit on adisplay on the primary adapter. The display device interface on theunattached adapter is called to configure command buffers to copy fromthe frame buffer of the second graphics processing unit to the buffer inthe system memory, when the graphics command comprises a displaycommand. The operating system kernel mode driver is called to scheduleexecution of the command buffers for the copy from the frame buffer ofthe second graphics processing unit to the buffer in system memory, whenthe graphics command comprises a display command. The device specifickernel mode driver is called to set command registers of the secondgraphics processing unit to copy from the frame buffer of the secondgraphics processing unit to the buffer in system memory, when thegraphics command comprises a display command. The display deviceinterface on the primary adapter is called to configure command buffersto copy from the buffer in system memory to a frame buffer of the firstgraphics processing unit, when the graphics command comprises a displaycommand. The operating system kernel mode driver is called to scheduleexecution of the copy from the buffer in system memory to the framebuffer of the first graphics processing unit, when the graphics commandcomprises a display command. The device specific kernel mode driver iscalled to set command registers of the first graphics processing unitfor the copy from the buffer in system memory to the frame buffer of thefirst graphics processing unit, when the graphics command comprises adisplay command. The display device interface on the unattached adapteris called to configure command buffers to present from the frame bufferof the first graphics processing unit, when the graphics commandcomprises a display command. The operating system kernel mode driver iscalled to schedule execution of the present command, when the graphicscommand comprises a display command. The device specific kernel modedriver is called to set command registers of the first graphicsprocessing unit to present, when the graphics command comprises adisplay command.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present technology are illustrated by way of exampleand not by way of limitation, in the figures of the accompanyingdrawings and in which like reference numerals refer to similar elementsand in which:

FIG. 1 shows a graphics processing technique according to the conventionart.

FIG. 2 shows a graphics co-processing computing platform, in accordancewith one embodiment of the present technology.

FIG. 3 shows a graphics co-processing technique, in accordance with oneembodiment of the present technology.

FIG. 4 shows a graphics co-processing technique, in accordance withanother embodiment of the present technology.

FIG. 5 shows a method of synchronizing copy and present operations on afirst and second GPU, in accordance with one embodiment of the presenttechnology.

FIG. 6 shows an exemplary set of render and display operations, inaccordance with one embodiment of the present technology.

FIG. 7 shows an exemplary set of render and display operations, inaccordance with another embodiment of the present technology.

FIG. 8 shows a method of compressing rendered data, in accordance withone embodiment of the present technology.

FIG. 9 shows an exemplary desktop 910 including an exemplary graphicaluser interface for selection of the GPU to run a given application, inaccordance with one embodiment of the present technology.

FIG. 10 shows a graphics co-processing technique, in accordance withanother embodiment of the present technology.

DETAILED DESCRIPTION OF THE INVENTION

Reference will now be made in detail to the embodiments of the presenttechnology, examples of which are illustrated in the accompanyingdrawings. While the present technology will be described in conjunctionwith these embodiments, it will be understood that they are not intendedto limit the invention to these embodiments. On the contrary, theinvention is intended to cover alternatives, modifications andequivalents, which may be included within the scope of the invention asdefined by the appended claims. Furthermore, in the following detaileddescription of the present technology, numerous specific details are setforth in order to provide a thorough understanding of the presenttechnology. However, it is understood that the present technology may bepracticed without these specific details. In other instances, well-knownmethods, procedures, components, and circuits have not been described indetail as not to unnecessarily obscure aspects of the presenttechnology.

Embodiments of the present technology introduce a shim layer between theruntime API (e.g., DirectX) and the device driver interface (DDI) (e.g.,user mode driver (UMD)) to separate the display commands from therendering commands, allowing retargeting of rendering commands to anadapter other than the adapter the application is displaying on. In oneimplementation, the shim layer allows the DDI layer to redirect aruntime (e.g., Direct3D (D3D)) default adapter creation to an off-screengraphics processing unit (GPU), such as a discrete GPU, not attached tothe desktop. The shim layer effectively layers the device driverinterface, and therefore does not hook a system component.

Referring to FIG. 2, a graphics co-processing computing platform, inaccordance with one embodiment of the present technology is shown. Theexemplary computing platform may include one or more central processingunits (CPUs) 205, a plurality of graphics processing units (GPUs) 210,215, volatile and/or non-volatile memory (e.g., computer readable media)220, 225, one or more chip sets 230, 235, and one or more peripheraldevices 215, 240-260 communicatively coupled by one or more busses. TheGPUs include heterogeneous designs. In one implementation, a first GPUmay be an integral graphics processing unit (iGPU) and a second GPU maybe a discrete graphics processing unit (dGPU). The chipset 230, 235 actsas a simple input/output hub for communicating data and instructionsbetween the CPU 205, the GPUs 210, 215, the computing device-readablemedia 220, 225, and peripheral devices 215, 240-265. In oneimplementation, the chipset includes a northbridge 230 and southbridge235. The northbridge 230 provides for communication between the CPU 205,system memory 220 and the southbridge 235. In one implementation, thenorthbride 230 includes an integral GPU. The southbridge 235 providesfor input/output functions. The peripheral devices 215, 240-265 mayinclude a display device 240, a network adapter (e.g., Ethernet card)245, CD drive, DVD drive, a keyboard, a pointing device, a speaker, aprinter, and/or the like. In one implementation, the second graphicsprocessing unit is coupled as a discrete GPU peripheral device 215 by abus such as a Peripheral Component Interconnect Express (PCIe) bus.

The computing device-readable media 220, 225 may be characterized asprimary memory and secondary memory. Generally, the secondary memory,such as a magnetic and/or optical storage, provides for non-volatilestorage of computer-readable instructions and data for use by thecomputing device. For instance, the disk drive 225 may store theoperating system (OS), applications and data. The primary memory, suchas the system memory 220 and/or graphics memory, provides for volatilestorage of computer-readable instructions and data for use by thecomputing device. For instance, the system memory 220 may temporarilystore a portion of the operating system, a portion of one or moreapplications and associated data that are currently used by the CPU 205,GPU 210 and the like. In addition, the GPUs 210, 215 may includeintegral or discrete frame buffers 211, 216.

Referring to FIG. 3, a graphics co-processing technique, in accordancewith one embodiment of the present technology, is shown. When anapplication 110 starts, it calls the user mode level runtime applicationprogramming interface (e.g., DirectX API d3d9.dll) 120 to determine whatdisplay adapters are available. In addition, an applicationinitialization routine is injected when the application starts. In oneimplementation, the application initialization routine is a shortdynamic link library (e.g., appin.dll). The application initializationroutine injected in the application includes some entry points, one ofwhich includes a call (e.g., set_dll_searchpath( )) to change the searchpath for the display device driver interface. During initialization, thesearch path for the device driver interface (e.g., c:\windows\system32\. . . \umd.dll) is changed to the search path of a shim layer library(e.g., c:\ . . . \coproc\ . . . \umd.dll). Therefore the runtime API 120will search for the same DDI name but in a different path, which willresult in the runtime API 120 loading the shim layer 125.

The shim layer library 125 has the same entry points as a conventionaldisplay driver interface (DDI). The runtime API 120 passes one or morefunction pointers to the shim layer 125 when calling into the applicableentry point (e.g., OpenAdapter( )) in the shim layer 125. The functionpointers passed to the shim layer 125 are call backs into the runtimeAPI 120. The shim layer 125 stores the function pointers. The shim layer125 loads and initializes the DDI on the primary adapter 130. The DDI onthe primary adapter 130 returns a data structure pointer to the shimlayer 125 representing the attached adapter. The shim layer 125 alsoloads and initializes the device driver interface on the unattachedadapter 135 by passing two function pointers which are call backs intolocal functions of the shim layer 125. The DDI on the unattached adapter135 also returns a data structure pointer to the shim layer 125representing the unattached adapter. The data structure pointersreturned by the DDI on the primary adapter 130 and unattached adapter135 are stored by the shim layer 125. The shim layer 125 returns to theruntime API 120 a pointer to a composite data structure that containsthe two handles. Accordingly, the DDI on the unattached adapter 135 isable to initialize without talking back to the runtime API 120.

In one implementation, the shim layer 125 is an independent library. Theindependent shim layer may be utilized when the primary GPU/display andthe secondary GPU are provided by different vendors. In anotherimplementation, the shim layer 125 may be integral to the display deviceinterface on the unattached adapter. The shim layer integral to thedisplay device driver may be utilized when the primary GPU/display andsecondary GPU are from the same vendor.

The application initialization routine (e.g., appin.dll) injected in theapplication also includes other entry points, one of which includes anapplication identifier. In one implementation, the applicationidentifier may be the name of the application. The shim layer 125application makes a call to the injected application initializationroutine (e.g., appin.dll) to determine the application identifier when agraphics command is received. The application identifier is comparedwith the applications in a white list (e.g., a text file). The whitelist indicates an affinity between one or more applications and thesecond graphics processing unit. In one implementation, the white listincludes one or more applications that would perform better if executedon the second graphics processing unit.

If the application identifier is not on the white list, the shim layer125 calls the device driver interface on the primary adapter 130. Thedevice driver interface on the primary adapter 130 sets the commandbuffers. The device driver interface on the primary adapter then calls,through the runtime 120 and a thunk layer 140, to the operating systemkernel mode driver (e.g., DirectX driver dxgkrnl.sys) 150. The operatingsystem kernel mode driver 160 in turn schedules the graphics commandwith the device specific kernel mode driver (e.g., kmd.sys) 160 for theGPU 210 attached to the primary display 240. The GPU 210 attached to theprimary display 240 is also referred to hereinafter as the first GPU.The device specific kernel mode driver 160 sets command register of theGPU 210 to execute the graphics command on the GPU 210 (e.g., integralGPU) attached to the primary display 240.

If the application identifier is a match to one or more identifiers onthe white list, the handle from the runtime API 120 is swapped by theshim layer 125 with functions local to the shim layer 125. For arendering command, the local function stored in the shim layer 125 willcall into the DDI on the unattached adapter 135 to set command buffer.In response, the DDI on the unattached adapter 135 will call localfunctions in the shim layer 125 that route the call through the thunklayer 140 to the operating system kernel mode driver 150 to schedule therendering command. The operating system kernel mode driver 150 calls thedevice specific kernel mode driver (e.g., dkmd.sys) 165 for the GPU onthe unattached adapter 215 to set the command registers. The GPU on theunattached adapter 215 (e.g., discrete GPU) is also referred tohereinafter as the second GPU. Alternatively, the DDI on the unattachedadapter 135 can call local functions in the thunk layer 140. The thunklayer 140 routes the graphics request to the operating system kernelmode driver (e.g., DirectX driver dxgkrnl.sys) 150. The operating systemkernel mode driver 150 schedules the graphics command with the devicespecific kernel mode driver (e.g., dkmd.sys) 165 on the unattachedadapter. The device specific kernel mode driver 165 controls the GPU onthe unattached adapter 215.

For a display related command (e.g., Present( )), the shim layer 125splits the display related command received from the application 110into a set of commands for execution by the GPU on the unattachedadapter 215 and another set of commands for execution by the GPU on theprimary adapter 210. In one implementation, when the shim layer 125receives a present call from the runtime 120, the shim layer 125 callsto the DDI on the unattached adapter 135 to cause a copy the framebuffer 216 of the GPU on the unattached adapter 215 to a correspondingbuffer in system memory 220. The shim layer 125 will also call the DDIon the primary adapter 130 to cause a copy from the corresponding bufferin system memory 220 to the frame buffer 211 of the GPU on the attachedadapter 210 and then a present by the GPU on the attached adapter 210.The memory accesses between the frame buffers 211, 216 and system memory220 may be direct memory accesses (DMA). To synchronize the copy andpresents on the GPUs 210, 215, a display thread is created, that isnotified when the copy to system memory by the second GPU 215 is done.The display thread will then queue the copy from system memory 220 andthe present call into the GPU on the attached adapter 210.

In another implementation, the operating system (e.g., Window7Starter)will not load a second graphics driver 165. Referring now to FIG. 4, agraphics co-processing technique, in accordance with another embodimentof the present technology, is shown. When the operation system will notload a second graphics driver, the second GPU 475 is tagged as anon-graphics device adapter that has its own driver 465. Therefore thesecond GPU 475 and its device specific kernel mode driver 465 are notseen by the operating system as a graphics adapter. In oneimplementation, the second GPU 475 and its driver 465 are tagged as amemory controller. The shim layer 125 loads and configures the DDI 130for the first GPU 210 on the primary adapter and the DDI 135 for thesecond GPU 475 If there is a specified affinity for executing renderingcommands from the application 110 on the second GPU 475, the shim layer125 intercepts the rendering commands sent by the runtime API 120 to theDDI on the primary adapter 130, calls the DDI on the unattached adapterto sets the commands buffers for the second GPU 475, and routes them tothe driver 465 for the second GPU 475. The shim layer 125 alsointercepts the callbacks from the driver 465 for the second GPU 475 tothe runtime 120. In another implementation, the shim layer 125implements the DDI 135 for the second GPU 475. Accordingly, the shimlayer 125 splits graphics command and redirects them to the two DDIs130, 135.

Accordingly, the embodiments described with reference to FIG. 3, enablesthe application to run on a second GPU instead of a first GPU when theparticular version of the operating system will allow the driver for thesecond GPU to be loaded but the runtime API will not allow a seconddevice driver interface to be initialized. The embodiments describedwith reference to FIG. 4 enables an application to run on a second GPU,such as a discrete GPU, instead of a first GPU, such as an integratedGPU, when the particular version of the operation system (e.g.,Win7Starter) will not allow the driver for the second GPU to be loaded.The DDI 135 for the second GPU 475 cannot talkback through the runtime120 or the thunk layer 140 to a graphics adapter handled by an OSspecific kernel mode driver.

Referring now to FIG. 5, a method of synchronizing the copy and presentoperations on the first and second GPUs is shown. The method isillustrated in FIG. 6 with reference to an exemplary set of render anddisplay operations, in accordance with one embodiment of the presenttechnology. At 510, the shim layer 125 receives a plurality of rendering605-615 and display operations for execution by the GPU on theunattached adapter 215. At 520, the shim layer 125 splits each displayoperation into a set of commands including 1) a copy 620-630 from aframe buffer 216 of the GPU on the unattached adapter 215 to acorresponding buffer in system memory 220 having shared access with theGPU on the attached adapter 210, 2) a copy 635, 640 from the buffer inshared system memory 220 to a frame buffer of the GPU on the primaryadapter 210, and 3) a present 645, 650 on the primary display 240 by theGPU on the primary adapter 210. At 530, the copy and present operationson the first and second GPUs 210, 215 are synchronized.

The frame buffers 211, 216 and shared system memory 220 may be double orring buffered. In a double buffered implementation, the currentrendering operations is stored in a given one of the double buffers 605and the other one of the double buffers is blitted to a correspondinggiven one of the double buffers of the system memory. When the renderingoperation is complete, the next rendering operation is stored in theother one of the double buffers and the content of the given one of thedouble buffers is blitted 620 to the corresponding other one of thedouble buffers of the system memory. The rendering and blittingalternate back and forth between the buffers of the frame buffer of thesecond GPU 215. The blit to system memory is executed asynchronously. Inanother implementation, the frame buffer of the second GPU 215 is doublebuffered and the corresponding buffer in system memory 220 is a threebuffer ring buffer.

After the corresponding one of the double buffers of the frame buffer216 in the second GPU 215 is blitted 620 to the system memory 220, thesecond GPU 210 generates an interrupt to the OS. In one implementation,the OS is programmed to signal an event to the shim layer 125 inresponse to the interrupt and the shim layer 125 is programmed to waiton the event before sending a copy command 635 and a present command 645to the first GPU 210. In a thread separate from the application thread,referred to hereinafter as the display thread, the shim layer waits forreceipt of the event indicating that the copy from the frame buffer tosystem memory is done, referred to herein after as the copy eventinterrupt. A separate thread is used so that the rendering commands onthe first and second GPUs 210, 215 are not stalled in the applicationthread while waiting for the copy event interrupt. The display threadmay also have a higher priority than the application thread.

A race condition may occur where the next rendering to a given one ofthe double buffers for the second GPU 215 begins before the previouscopy from the given buffer is complete. In such case, a plurality ofcopy event interrupts may be utilized. In one implementation, a ringbuffer and four events are utilized.

Upon receipt of the copy event interrupt, the display thread queues theblit from system memory 220 and the present call into the first GPU 210.The first GPU 210 blits the given one of the system memory 220 buffersto a corresponding given one of the frame buffers of the first GPU 210.When the blit operation is complete, the content of the given one of theframe buffers of the first GPU 210 is presented on the primary display240. When the next copy and present commands are received by the firstGPU 210, the corresponding other of the system memory 220 buffers isblitted into the other one of the frame buffer of the first GPU 210 andthen the content is presented on the primary display 240. The blit andpresent alternate back and forth between the double buffered framebuffer of the first GPU 210. The copy event interrupt is used to delayprogramming, thereby effectively delaying the scheduling of the copyfrom system memory 220 to the frame buffer of the first GPU 210 andpresenting on the primary display 240.

In one implementation, a notification on the display side indicates thatthe frame has been present on the display 240 by the first GPU 210. TheOS is programmed to signal an event when the command buffer causing thefirst GPU 210 to present its frame buffer on the display is doneexecuting. The notification maintains synchronization where anapplication runs with vertical blank (vblank) synchronization.

Referring now to FIG. 7, an exemplary set of render and displayoperations, in accordance with another embodiment of the presenttechnology, is shown. The rendering and copy operations executed on thesecond GPU 215 may be performed by different engines. Therefore, therendering and copy operations may be performed substantiallysimultaneously in the second GPU 215.

Generally, the second GPU 215 is coupled to the system memory 220 by abus having a relatively high bandwidth. However, in some systems the buscoupling the second GPU 215 may not provide sufficient bandwidth forblitting the frame buffer 216 of the second GPU 215 to system memory220. For example, an application may be rendered at a resolution of1280×1024 pixels. Therefore, approximately 5 MB/frame of RGB data isrendered. If the application renders at 100 frame/s, than the second GPUneeds approximately 500 MB/s for blitting upstream to the system memory220. However, a Peripheral Component Interconnect Express (PCIe) 1× bustypically used to couple the second GPU 215 system memory 220 has abandwidth of approximately 250 MB/s in each direction. Referring now toFIG. 8, a method of compressing rendered data, in accordance with oneembodiment of the present technology is shown. The second GPU 215renders frames of RGB data, at 810. At 820, the frames of RGB data areconverted using a pixel shader in the second GPU 215 to YUV sub-sampledata. The RGB data is processed as texture data by the pixel shader inthree passes to generate YUV sub-sample data. In one implementation, theU and V components are sub-sampled spatially, however, the Y is notsub-sampled. The RGB data may be converted to YUV data using the 4.2.0color space conversion algorithm. At 830, the YUV sub-sample data isblitted to the corresponding buffers in the system memory with anasynchronous copy engine of the second GPU. The YUV sub-sample data isblitted from the system memory to buffers of the first GPU, at 840. TheYUV data is blitted to corresponding texture buffers in the second GPU.The Y, U, and V sub-sample data are buffered in three correspondingbuffers, and therefore the copy from frame buffer of the second GPU 215to the system memory 220 and the copy from system memory 220 to thetexture buffers of first GPU 210 are each implemented by sets of threecopies. The YUV sub-sample data is converted using a pixel shader in thefirst GPU 210 to recreate the RGB frame data, at 850. The device driverinterface on the attached adapter is programmed to render a fullscreened aligned quad from the corresponding texture buffers holding theYUV data. At 860, the recreated RGB frame data is then presented on theprimary display 240 by the first GPU 210. Accordingly, the shaders areutilized to provide YUV compression and decompression.

In one implementation, each buffer of Y, U and V samples is doublebuffered in the frame buffer of the second GPU 215 and the system memory220. In addition, the Y, U and V samples copied into the first GPU 210are double buffered as textures. In another implementation, the Y, U andV sample buffers in the second GPU 215 and corresponding texture buffersin the first GPU 210 are each double buffered. The Y, U and V samplebuffered in the system memory 220 may each be triple buffered.

In one implementation, the shim layer 125 tracks the bandwidth neededfor blitting and the efficiency of transfers on the bus to enable thecompression or not. In another implementation, the shim layer 125enables the YUV compression or not based on the type of application. Forexample, the shim layer 125 may enable compression for game applicationbut not for technical applications such as a Computer Aided Drawing(CAD) application.

In one embodiment the white list accessed by the shim layer 125 todetermine if graphics requests should be executed on the first GPU 210or the second GPU 215 is loaded and updated by the a vendor and/orsystem administrator. In another embodiment, a graphical user interfacecan be provided to allow the user to specific the use of the second GPU(e.g., discrete GPU) 215 for rendering a given application. The user mayright click on the icon for the given application. In response to theuser selection, a graphical user interface may be generated that allowsthe user to specify the second GPU for use when rendering image for thegiven application. In one implementation, the operating system isprogrammed to populate the graphical interface with a choice to run thegiven application on the GPU on the unattached adapter. A routine (e.g.,dynamic linked library) registered to handle this context menu item willscan the shortcut link to the application, gather up the options andargument, and then call an application launcher that will spawn aprocess to launch the application as well as setting an environmentvariable that will be read by the shim layer 125. In response, the shimlayer 125 will run the graphics context for the given application on thesecond GPU 215. Therefore, the user can override, update, or the like,the white list loaded on the computing device.

Referring now to FIG. 9, an exemplary desktop 910 including an exemplarygraphical user interface for selection of the GPU to run a givenapplication on is shown. The desktop includes icons 920-950 for one ormore applications. When the user right clicks on a given application,930 a pull-down menu 970 is generated. The pull-down menu 970 ispopulated with an additional item of ‘run on dGPU’ or the like. The menuitem for the second GPU 215 may provide for product branding byidentifying the manufacturer and/or model of the second GPU. If the userselects the ‘run’ item or double left clicks on the icon, the graphicsrequests from the given application will run on the GPU on the primaryadapter (e.g., the default iGPU) 210. If the user selects the ‘run ondGPU’ item, the graphics requests from the given application will run onthe GPU on the unattached adapter (e.g., dGPU) 215.

In another implementation, the second graphics processing unit maysupport a set of rendering application programming interfaces and thefirst graphics processing unit may support a limited subset of the sameapplication programming interfaces. An application programming interfaceis implemented by a different runtime API 120 and a matching driverinterface 130. Referring now to FIG. 10, a graphics co-processingtechnique, in accordance with another embodiment of the presenttechnology, is shown. The runtime API 120 loads a shim layer 125 thatwill support all device driver interfaces. The shim layer 125 loads andconfigures the DDI 130 for the first GPU 210 using a device driverinterface that this one supports on the primary adapter and the DDI 135for the second GPU 215 of a second device driver interface that can talkwith the runtime API 120. For example, in one implementation, the secondGPU 215 may be a DirectX10 class device and the first GPU 210 may be aDirectX9 class device that does not support DirectX10. The shim layer125 appears to the DDI 130 for the first GPU 210 as a first applicationprogramming class runtime API (e.g., D3D9.dll), translates commandbetween the two device driver interface classes and may also convertbetween display formats.

The shim layer 125 includes a translation layer 126 that translatescalls between the runtime API 120 device driver interface and the devicedriver interface class. In one implementation, the shim layer 125translates display commands between the DirectX10 runtime API 120 andthe DirectX9 DDI on the primary adapter 130. The shim layer, therefore,creates a Dx9 compatible context on the first GPU 210, which is therecipient of frames rendered by the Dx10 class second GPU 215. The shimlayer 125 advantageously splits graphics commands into rendering anddisplay commands, redirects the rendering commands to the DDI on theunattached adapter 135 and the display commands to the DDI on theprimary adapter 130. The shim layer also translates between the commandsfor the Dx9 DDI on the primary adapter 130, the Dx10 DDI on theunattached adapter 135, the Dx10 runtime API 120 and Dx10 thunk layer140, and provides for format conversion of necessary. The shim layer125, in one implementation, intercepts commands from the Dx10 runtime120 and translates these into the DX9 DDI on the primary adapter (e.g.,iUMD.dll). The commands may include: CreateResource, OpenResource,DestroyResource, DxgiPresent—which triggers the surface transfermechanism that ends up with the surface displayed on the iGPU,DxgiRotateResourceIdentities, DxgiBlt—present blits are translated, andDxgiSetDisplayMode.

The Dx9 DDI 130 for the first GPU 210 cannot talk back directly throughthe runtime 120 to talk to a graphics adapter handled by an OS specifickernel mode driver because the runtime 120 expects the call to come froma Dx10 device. The shim layer 125 intercepts callbacks from the Dx9 DDIand exchanges device handles, before forwarding the callback to the Dx10runtime API 120, which expects the calls to come from a Dx10 device.Dx10 and Dx11 runtime APIs 120 use a layer for presentation called DXGI,which has its own present callback, not existing in the Dx9 callbackinterface. Therefore, when the display side DDI on the primary adaptercalls the present callback, the shim layer translates it to a DXGIcallback. For example:

-   -   PFND3DDDI_PRESENTCB->PFNDDXGIDDI_PRESENTCB

The shim layer 125 may also include a data structure 127 for convertingdisplay formats between the first graphics processing unit DDI and thesecond graphics processing unit DDI. For example, the shim layer 125 mayinclude a lookup table to convert a 10 bit rendering format in Dx10 toan 8 bit format supported by the Dx9 class integrated GPU 210. Therendered frame may be copied to a staging surface, a two-dimensional(2D) engine of the discrete GPU 215 utilizes the lookup table to convertthe rendered frame to a Dx9 format. The Dx9 format frame is then copiedto the frame buffer of the integrated GPU 210 and then presented on theprimary display 240. For example, the following format conversions maybe performed:

DXGI_FORMAT_R16G16B16A16_FLOAT(render)->D3DDDIFMT_A8R8G8B8(display),DXGI_FORMAT_R10G10B10A2_UNORM(render)->D3DDDIFMT_A8R8G8B8(display).In one implementation, the copying and conversion can happen as anatomic operation.

The foregoing descriptions of specific embodiments of the presenttechnology have been presented for purposes of illustration anddescription. They are not intended to be exhaustive or to limit theinvention to the precise forms disclosed, and obviously manymodifications and variations are possible in light of the aboveteaching. The embodiments were chosen and described in order to bestexplain the principles of the present technology and its practicalapplication, to thereby enable others skilled in the art to best utilizethe present technology and various embodiments with variousmodifications as are suited to the particular use contemplated. It isintended that the scope of the invention be defined by the claimsappended hereto and their equivalents.

What is claimed is:
 1. A method comprising: injecting an applicationinitialization routine, when an application starts, that includes anentry point that changes a search path for a display device interface toa search path of a shim layer library, and that includes an entry pointthat identifies the application; loading the shim layer library, at thechanged search path, that initializes a display device interface for afirst graphics processing unit on a primary adapter and a display deviceinterface for a second graphics processing unit on an unattachedadapter, wherein the display device interface on the unattached adapteris initialized without calling back to a runtime application programminginterface, and that determines if the application has an affinity forexecution of graphics commands on the second graphics processing unit;splitting a display command, by the shim layer library, into a copy froma frame buffer of the second graphics processing unit to a buffer insystem memory, a copy from the buffer in system memory to a frame bufferof the first graphics processing unit, and a present from the framebuffer of the first graphics processing unit on a display if there is anaffinity.
 2. The method according to claim 1, wherein the display devicedriver interface on the primary adapter comprises a user mode driverdynamic linked library (UMD.dll) for the first graphic processing unit.3. The method according to claim 1, wherein the display device driverinterface on the unattached adapter comprises a user mode driver dynamiclinked library (UMD.dll) for the second graphic processing unit.
 4. Themethod according to claim 1, wherein the application initializationroutine comprises a dynamic linked library.
 5. The method according toclaim 1, wherein the runtime application programming interface comprisesa DirectX application programming interface (D3Dx.dll).
 6. The methodaccording to claim 1, wherein the first graphics processing unitcomprises an integrated graphics processing unit.
 7. The methodaccording to claim 1, wherein the second graphics processing unitcomprises a discrete graphics processing unit.
 8. The method accordingto claim 1, wherein the first graphics processing unit and the secondgraphics processing unit are heterogeneous graphics processing units. 9.The method according to claim 1, wherein the shim layer librarydetermines if the application has an affinity if the application is on awhite list.
 10. The method according to claim 9, wherein the white listincludes the identifier of one or more applications that perform betteron the second graphics processing unit than the first graphicsprocessing unit.
 11. One or more computing device readable media havingcomputing device executable instructions which when executed perform amethod comprising: receiving a display operation for execution by agraphics processing unit on an unattached adapter; splitting the displayoperation into a copy from a frame buffer of the graphics processingunit on the unattached adapter to a buffer in system memory, a copy fromthe buffer in system memory to a frame buffer of graphics processingunit on a primary adapter, and a present from the frame buffer of thegraphics processing unit on the primary adapter to a display; andsynchronizing execution of the copy from the frame buffer of thegraphics processing unit on the unattached adapter to the buffer insystem memory and the copy from the buffer in system memory to the framebuffer of the graphics processing unit on the primary adapter.
 12. Theone or more computing device readable media having computing deviceexecutable instructions which when executed perform the method of claim11, wherein the frame buffer of the graphics processing unit on theunattached adapter is double buffered.
 13. The one or more computingdevice readable media having computing device executable instructionswhich when executed perform the method of claim 11, wherein the framebuffer of the graphics processing unit on the primary adapter is doublebuffered.
 14. The one or more computing device readable media havingcomputing device executable instructions which when executed perform themethod of claim 11, wherein the system memory is double buffered. 15.The one or more computing device readable media having computing deviceexecutable instructions which when executed perform the method of claim11, wherein the system memory is ring buffered.
 16. The one or morecomputing device readable media having computing device executableinstructions which when executed perform the method of claim 11, whereinthe graphics processing unit on the primary adapter and the graphicsprocessing unit on the unattached adapter are heterogeneous graphicsprocessing units.
 17. One or more computing device readable media havingcomputing device executable instructions which when executed perform amethod comprising: loading a device specific kernel mode driver of asecond graphics processing unit tagged as a non-graphics device; loadingand initializing a device driver interface and a device specific kernelmode driver for a first graphics processing unit on a primary adapter;and loading and initializing a device driver interface for the secondgraphics processing unit on a non-graphics device tagged adapter withoutthe device driver interface talking back to a runtime applicationprogramming interface when a particular version of an operating systemwill not allow the device specific kernel mode driver for the secondgraphics processing unit to be loaded; splitting a display command intoa copy from a frame buffer of the second graphics processing unit to abuffer in system memory, a copy from the buffer in system memory to aframe buffer of the first graphics processing unit, and a present fromthe frame buffer of the first graphics processing unit on a display onthe primary adapter; calling the display device interface on theunattached adapter to configure command buffers to copy from the framebuffer of the second graphics processing unit to the buffer in thesystem memory, when the graphics command comprises a display command;calling the operating system kernel mode driver to schedule execution ofthe command buffers for the copy from the frame buffer of the secondgraphics processing unit to the buffer in system memory, when thegraphics command comprises a display command; calling the devicespecific kernel mode driver to set command registers of the secondgraphics processing unit to copy from the frame buffer of the secondgraphics processing unit to the buffer in system memory, when thegraphics command comprises a display command; calling the display deviceinterface on the primary adapter to configure command buffers to copyfrom the buffer in system memory to a frame buffer of the first graphicsprocessing unit, when the graphics command comprises a display command;calling the operating system kernel mode driver to schedule execution ofthe copy from the buffer in system memory to the frame buffer of thefirst graphics processing unit, when the graphics command comprises adisplay command; calling the device specific kernel mode driver to setcommand registers of the first graphics processing unit for the copyfrom the buffer in system memory to the frame buffer of the firstgraphics processing unit, when the graphics command comprises a displaycommand; calling the display device interface on the unattached adapterto configure command buffers to present from the frame buffer of thefirst graphics processing unit, when the graphics command comprises adisplay command; calling the operating system kernel mode driver toschedule execution of the present command, when the graphics commandcomprises a display command; and calling to set command registers of thefirst graphics processing unit to present, when the graphics commandcomprises a display command.
 18. The one or more computing devicereadable media having computing device executable instructions whichwhen executed perform the method of claim 17, wherein the copy from theframe buffer of the second graphics processing unit to the system memoryand the copy from the system memory to a frame buffer of the firstgraphics processing unit are direct memory accesses.
 19. The one or morecomputing device readable media having computing device executableinstructions which when executed perform the method of claim 17, furthercomprising synchronizing sequential execution of the copy from the framebuffer of the second graphics processing unit to the system memory andthe copy from the system memory to a frame buffer of the first graphicsprocessing unit.
 20. The one or more computing device readable mediahaving computing device executable instructions which when executedperform the method of claim 19, wherein synchronizing sequentialexecution of the copy from the frame buffer of the second graphicsprocessing unit to the system memory and the copy from the system memoryto a frame buffer of the first graphics processing unit comprises:receiving notification when the copy from the frame buffer of the secondgraphics processing unit to the system memory is done, in a separatethread from a thread for the render and display commands; and queuingcalling the display device interface on the primary adapter to configurecommand buffers to copy from the system memory to the frame buffer ofthe first graphics processing unit and calling the display deviceinterface on the unattached adapter to configure command buffers topresent after receiving notification when the copy from the frame bufferof the second graphics processing unit to the system memory is done.